## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.0.5 ✓ dplyr 1.0.3
## ✓ tidyr 1.0.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
Definitions:
*Cancer diagnosis information from HES, coded using ICD10 and ICD9 codes. Additional cancer information at baseline from Self-reported cancers.
Composite scores were computed for Physical activity, Total meat intake, Red meat intake, White meat intake and Total fruit/veg intake.
Physical activity:
Total fruit/veg intake:
Total meat intake:
PCA was performed for numerical variables (physical activity and total fruit/veg intake). The first component was used as the composite score.
Multiple Correspondence Analysis (MCA) was performed for categorical variables (mean intake). The first component was used as the composite score.
Student’s t-test used to compare continuous variables, chi-squared test used to compare categorical variables. There is a separate table for biomarkers and also a table to appear in the supplementary material with variables such as the composite scores described above. Any p values <0.001 are not printed, but p values below the Bonferroni threshold are emboldened.
##
## Attaching package: 'flextable'
## The following objects are masked from 'package:kableExtra':
##
## as_image, footnote
## The following object is masked from 'package:purrr':
##
## compose
| Table 1 | Case/control status | P values | |||
Characteristic | N | Control (n=449042)1 | Lung (n=1987)1 | Bladder (n=1724)1 | Control vs. Lung2 | Control vs. Bladder2 |
Age | 452,753 | 56.12 (8.11) | 61.62 (5.86) | 61.56 (6.14) | <0.001 | <0.001 |
Sex | 452,753 | <0.001 | <0.001 | |||
Female | 240,963 (54%) | 969 (49%) | 415 (24%) | |||
Male | 208,079 (46%) | 1,018 (51%) | 1,309 (76%) | |||
Ethnicity | 450,201 | <0.001 | <0.001 | |||
White | 421,004 (94%) | 1,916 (97%) | 1,661 (97%) | |||
Non-white | 25,510 (5.7%) | 58 (2.9%) | 52 (3.0%) | |||
Townsend deprivation index | 452,196 | -1.30 (3.09) | 0.08 (3.64) | -1.33 (3.07) | <0.001 | 0.7 |
Current employment status | 447,508 | <0.001 | <0.001 | |||
Employed | 264,840 (60%) | 658 (33%) | 717 (42%) | |||
Active | 27,098 (6.1%) | 85 (4.3%) | 84 (4.9%) | |||
Retired | 130,946 (30%) | 1,014 (52%) | 816 (48%) | |||
Unable to work | 13,479 (3.0%) | 154 (7.8%) | 73 (4.3%) | |||
Unemployed | 7,470 (1.7%) | 54 (2.7%) | 20 (1.2%) | |||
Education | 443,599 | <0.001 | <0.001 | |||
Intermediate | 220,578 (50%) | 877 (46%) | 817 (48%) | |||
Low | 73,643 (17%) | 737 (38%) | 446 (26%) | |||
High | 145,764 (33%) | 307 (16%) | 430 (25%) | |||
Type of accomodation lived in | 451,473 | <0.001 | 0.2 | |||
House | 400,409 (89%) | 1,622 (82%) | 1,538 (89%) | |||
Flat | 44,537 (9.9%) | 321 (16%) | 165 (9.6%) | |||
Other | 2,824 (0.6%) | 40 (2.0%) | 17 (1.0%) | |||
Own or rent accomodation lived in | 447,450 | <0.001 | 0.5 | |||
Own | 395,569 (89%) | 1,487 (76%) | 1,507 (88%) | |||
Rent | 41,245 (9.3%) | 442 (23%) | 172 (10%) | |||
Other | 6,982 (1.6%) | 21 (1.1%) | 25 (1.5%) | |||
Number of people in household | 448,693 | <0.001 | <0.001 | |||
2 | 203,994 (46%) | 1,005 (52%) | 970 (57%) | |||
1 | 81,453 (18%) | 544 (28%) | 342 (20%) | |||
3-4 | 134,223 (30%) | 349 (18%) | 355 (21%) | |||
≥5 | 25,369 (5.7%) | 51 (2.6%) | 38 (2.2%) | |||
Average total household income | 384,236 | <0.001 | <0.001 | |||
18,000-30,999 | 95,722 (25%) | 458 (29%) | 430 (29%) | |||
<18,000 | 84,060 (22%) | 726 (46%) | 441 (30%) | |||
31,000-51,999 | 100,586 (26%) | 255 (16%) | 350 (24%) | |||
>52,000 | 100,806 (26%) | 146 (9.2%) | 256 (17%) | |||
BMI | 449,917 | <0.001 | <0.001 | |||
Normal | 145,007 (32%) | 609 (31%) | 394 (23%) | |||
Underweight | 2,253 (0.5%) | 22 (1.1%) | 8 (0.5%) | |||
Pre-obesity | 189,924 (43%) | 814 (41%) | 811 (47%) | |||
Obesity class I | 78,095 (18%) | 389 (20%) | 358 (21%) | |||
Obesity class II | 22,314 (5.0%) | 85 (4.3%) | 109 (6.4%) | |||
Obesity class III | 8,654 (1.9%) | 43 (2.2%) | 28 (1.6%) | |||
Sleep per 24 hours (hours) | 448,951 | <0.001 | 0.017 | |||
7-8 | 301,971 (68%) | 1,175 (60%) | 1,114 (65%) | |||
≤6 | 110,107 (25%) | 561 (29%) | 447 (26%) | |||
≥9 | 33,204 (7.5%) | 219 (11%) | 153 (8.9%) | |||
Number of days per week spent 10+ mins walking | 445,131 | <0.001 | 0.001 | |||
0 | 10,945 (2.5%) | 96 (5.0%) | 67 (4.0%) | |||
1 | 12,271 (2.8%) | 54 (2.8%) | 36 (2.1%) | |||
2 | 27,042 (6.1%) | 101 (5.2%) | 90 (5.3%) | |||
3 | 35,151 (8.0%) | 138 (7.2%) | 126 (7.4%) | |||
4 | 35,713 (8.1%) | 146 (7.6%) | 137 (8.1%) | |||
5 | 71,871 (16%) | 253 (13%) | 250 (15%) | |||
6 | 44,872 (10%) | 176 (9.1%) | 173 (10%) | |||
7 | 203,647 (46%) | 963 (50%) | 813 (48%) | |||
Number of days per week spent 10+ mins doing moderate exercise | 428,501 | <0.001 | 0.3 | |||
0 | 54,268 (13%) | 300 (17%) | 220 (14%) | |||
1 | 34,769 (8.2%) | 89 (4.9%) | 134 (8.3%) | |||
2 | 62,587 (15%) | 219 (12%) | 233 (14%) | |||
3 | 64,102 (15%) | 257 (14%) | 215 (13%) | |||
4 | 42,246 (9.9%) | 165 (9.1%) | 158 (9.8%) | |||
5 | 64,163 (15%) | 266 (15%) | 233 (14%) | |||
6 | 23,733 (5.6%) | 105 (5.8%) | 102 (6.3%) | |||
7 | 79,210 (19%) | 404 (22%) | 323 (20%) | |||
Number of days per week spent 10+ mins doing vigorous exercise | 428,191 | <0.001 | <0.001 | |||
0 | 156,834 (37%) | 921 (52%) | 666 (41%) | |||
1 | 60,248 (14%) | 185 (10%) | 213 (13%) | |||
2 | 67,527 (16%) | 199 (11%) | 251 (15%) | |||
3 | 59,155 (14%) | 167 (9.4%) | 164 (10%) | |||
4 | 27,802 (6.5%) | 85 (4.8%) | 114 (7.0%) | |||
5 | 29,495 (6.9%) | 106 (6.0%) | 108 (6.7%) | |||
6 | 8,609 (2.0%) | 35 (2.0%) | 33 (2.0%) | |||
7 | 15,122 (3.6%) | 80 (4.5%) | 72 (4.4%) | |||
Processed meat intake | 450,677 | <0.001 | <0.001 | |||
Never | 41,791 (9.3%) | 151 (7.6%) | 100 (5.8%) | |||
Less than once a week | 135,473 (30%) | 510 (26%) | 463 (27%) | |||
Once a week | 130,337 (29%) | 568 (29%) | 532 (31%) | |||
More than once a week | 139,384 (31%) | 747 (38%) | 621 (36%) | |||
Oily fish intake | 448,986 | <0.001 | 0.029 | |||
Never | 49,715 (11%) | 265 (13%) | 156 (9.2%) | |||
Less than once a week | 148,546 (33%) | 614 (31%) | 563 (33%) | |||
Once a week | 167,554 (38%) | 700 (36%) | 649 (38%) | |||
More than once a week | 79,501 (18%) | 390 (20%) | 333 (20%) | |||
Non oily fish intake | 449,337 | 0.2 | 0.012 | |||
Never | 21,314 (4.8%) | 101 (5.1%) | 73 (4.3%) | |||
Less than once a week | 130,302 (29%) | 534 (27%) | 454 (27%) | |||
Once a week | 221,148 (50%) | 999 (51%) | 915 (54%) | |||
More than once a week | 72,897 (16%) | 333 (17%) | 267 (16%) | |||
Poultry intake | 450,839 | <0.001 | <0.001 | |||
Never | 23,122 (5.2%) | 92 (4.7%) | 58 (3.4%) | |||
Less than once a week | 47,778 (11%) | 267 (14%) | 205 (12%) | |||
Once a week | 159,828 (36%) | 780 (40%) | 698 (41%) | |||
More than once a week | 216,425 (48%) | 833 (42%) | 753 (44%) | |||
Pork intake | 448,824 | <0.001 | <0.001 | |||
Never | 77,785 (17%) | 310 (16%) | 220 (13%) | |||
Less than once a week | 252,816 (57%) | 1,019 (52%) | 937 (55%) | |||
Once a week | 98,570 (22%) | 523 (27%) | 478 (28%) | |||
More than once a week | 15,998 (3.6%) | 101 (5.2%) | 67 (3.9%) | |||
Beef intake | 449,710 | <0.001 | <0.001 | |||
Never | 50,060 (11%) | 182 (9.2%) | 142 (8.3%) | |||
Less than once a week | 202,886 (45%) | 830 (42%) | 733 (43%) | |||
Once a week | 141,714 (32%) | 663 (34%) | 628 (37%) | |||
More than once a week | 51,363 (12%) | 293 (15%) | 216 (13%) | |||
Lamb intake | 448,631 | <0.001 | <0.001 | |||
Never | 79,728 (18%) | 334 (17%) | 250 (15%) | |||
Less than once a week | 252,464 (57%) | 962 (49%) | 928 (54%) | |||
Once a week | 99,098 (22%) | 552 (28%) | 463 (27%) | |||
More than once a week | 13,690 (3.1%) | 100 (5.1%) | 62 (3.6%) | |||
Salt added to food | 451,698 | <0.001 | <0.001 | |||
Never/rarely | 249,063 (56%) | 883 (45%) | 883 (51%) | |||
Sometimes | 125,605 (28%) | 582 (29%) | 490 (28%) | |||
Usually | 51,740 (12%) | 308 (16%) | 248 (14%) | |||
Always | 21,588 (4.8%) | 208 (10%) | 100 (5.8%) | |||
Tea intake per day (cups) | 451,657 | <0.001 | 0.010 | |||
0 | 65,823 (15%) | 375 (19%) | 236 (14%) | |||
≥1 | 52,145 (12%) | 192 (9.7%) | 206 (12%) | |||
2-3 | 131,685 (29%) | 431 (22%) | 455 (26%) | |||
≥4 | 198,305 (44%) | 981 (50%) | 823 (48%) | |||
Coffee intake per day (cups) | 451,534 | <0.001 | 0.2 | |||
0 | 99,520 (22%) | 484 (24%) | 351 (20%) | |||
≥1 | 121,637 (27%) | 401 (20%) | 469 (27%) | |||
2-3 | 138,412 (31%) | 498 (25%) | 535 (31%) | |||
≥4 | 88,265 (20%) | 597 (30%) | 365 (21%) | |||
Water intake per day (glasses) | 451,656 | <0.001 | <0.001 | |||
0 | 36,002 (8.0%) | 254 (13%) | 190 (11%) | |||
≥1 | 110,002 (25%) | 504 (25%) | 469 (27%) | |||
2-3 | 168,399 (38%) | 746 (38%) | 681 (40%) | |||
≥4 | 133,554 (30%) | 475 (24%) | 380 (22%) | |||
Alcohol intake frequency | 451,356 | <0.001 | <0.001 | |||
Never/rarely | 87,068 (19%) | 513 (26%) | 274 (16%) | |||
Occasionally | 166,211 (37%) | 637 (32%) | 577 (34%) | |||
Regularly | 194,383 (43%) | 825 (42%) | 868 (50%) | |||
Smoking status | 446,624 | <0.001 | <0.001 | |||
Never smoker - No smoker in household | 224,753 (51%) | 252 (13%) | 529 (31%) | |||
Never smoker - Yes, smoker in household | 21,466 (4.8%) | 24 (1.2%) | 43 (2.5%) | |||
Previous smoker - No smoker in household | 133,407 (30%) | 776 (40%) | 726 (43%) | |||
Previous smoker - Yes, smoker in household | 17,073 (3.9%) | 94 (4.8%) | 98 (5.8%) | |||
Current smoker | 46,281 (10%) | 800 (41%) | 302 (18%) | |||
Maternal smoking around birth | 390,442 | 112,784 (29%) | 615 (38%) | 444 (30%) | <0.001 | 0.5 |
NO2 (??g/m3) | 446,097 | 26.73 (7.58) | 27.95 (7.58) | 26.45 (7.68) | <0.001 | 0.13 |
NOx (??g/m3) | 446,097 | 44.14 (15.52) | 46.71 (16.02) | 44.28 (15.97) | <0.001 | 0.7 |
PM10 (??g/m3) | 415,390 | 16.24 (1.90) | 16.40 (1.87) | 16.16 (1.89) | <0.001 | 0.071 |
PM2.5 (absorbance/m) | 415,390 | 1.19 (0.27) | 1.21 (0.28) | 1.17 (0.27) | <0.001 | 0.017 |
PM2.5 (??g/m3) | 415,390 | 9.99 (1.06) | 10.20 (1.11) | 9.97 (1.05) | <0.001 | 0.4 |
PM2.5-10??m (??g/m3) | 415,390 | 6.43 (0.90) | 6.45 (0.90) | 6.41 (0.87) | 0.3 | 0.3 |
Number of medications | 451,933 | <0.001 | <0.001 | |||
0 | 127,761 (29%) | 293 (15%) | 351 (20%) | |||
1 | 85,684 (19%) | 275 (14%) | 265 (15%) | |||
>1 | 234,789 (52%) | 1,413 (71%) | 1,102 (64%) | |||
Parental history of COPD | 446,744 | 63,000 (14%) | 428 (22%) | 303 (18%) | <0.001 | <0.001 |
Parental history of diabetes | 446,744 | 77,822 (18%) | 266 (14%) | 269 (16%) | <0.001 | 0.074 |
Parental history of hypertension | 446,744 | 185,376 (42%) | 587 (30%) | 584 (34%) | <0.001 | <0.001 |
Parental history of stroke | 446,744 | 109,584 (25%) | 504 (26%) | 448 (26%) | 0.15 | 0.11 |
Parental history of heart disease | 446,744 | 177,952 (40%) | 780 (41%) | 712 (42%) | 0.8 | 0.12 |
Parental history of breast cancer | 442,332 | 32,279 (7.4%) | 121 (6.4%) | 128 (7.7%) | 0.11 | 0.7 |
Parental history of bowel cancer | 442,332 | 41,372 (9.4%) | 192 (10%) | 175 (10%) | 0.3 | 0.2 |
Parental history of lung cancer | 442,332 | 48,626 (11%) | 363 (19%) | 202 (12%) | <0.001 | 0.2 |
Parental history of prostate cancer | 442,332 | 29,271 (6.7%) | 88 (4.6%) | 98 (5.9%) | <0.001 | 0.2 |
Cardiovascular disease | 452,753 | 51,508 (11%) | 508 (26%) | 362 (21%) | <0.001 | <0.001 |
Hypertension | 452,753 | 120,527 (27%) | 790 (40%) | 684 (40%) | <0.001 | <0.001 |
Diabetes | 452,753 | 22,456 (5.0%) | 181 (9.1%) | 192 (11%) | <0.001 | <0.001 |
Respiratory disease | 452,753 | 72,417 (16%) | 558 (28%) | 300 (17%) | <0.001 | 0.2 |
Autoimmune disease | 452,753 | 49,706 (11%) | 282 (14%) | 189 (11%) | <0.001 | >0.9 |
1Mean (SD); n (%) | ||||||
2Student t-test for continuous, Chi-squared test for categorical | ||||||
Continuous variables were scaled prior to running logistic regression models to help visualize confidence intervals.
Manhattan
Significant for both models:
Sociodemographic:
Health risk:
Environment:
Medical:
Biomarkers:
Manhattan
Significant for both models: diabetes, HDL cholesterol.
P Values
A snapshot of all relevant p values. In this limited graph, we see that diabetes is significant for both models when not adjusted for smoking.
P Values
Diabetes, HDL cholesterol, and education:
P Values
All variables significant for lung cancer are marked. We see a number are significant for lung cancer and not bladder cancer.
P Values
Adjusted for smoking, only HDL cholesterol is still significant. Diabetes is no longer significant for both models.
P Values
We see that diabetes is significant for bladder cancer and not lung when adjusted for smoking.
P Values
Here there are much fewer significant markers, notably diabetes and education are less significant. The scale of the graph is also much smaller - meaning some effect has been taken away by adjusting for smoking. The significance of deprivation indicators has halved.
Forest
Some ORs go off scale here, or we can not see the CIs. This is due to varied scale of confidence intervals. I tried free_y, it was a disaster.
The ORs that go off scale are connected to smoking status.
Forest Zoom out
This forest plot is zoomed out. We lose information visually on the CIs of the ORs, but we can still see the sharp contrast of the smoking variables in the middle, and how sociodemographic variables are reduced in effect when we control for smoking.
Forest Plotrix
Here I’m just playing around, overlaying all of the models on top of one graph. What we can see is that often the models that cluster together are usually the ones adjusted for smoking, rather than clustering by cancer type.
-> How can we investigate this further?
Forest Plotrix
This looks terrible and I’m stuck on figuring out how to plot vertical confidence intervals.
-> Any advice?
What I did:
Age at Diagnosis Analysis
Age at Diagnosis Analysis
Time to Diagnosis Analysis
Time to Diagnosis Analysis
Models:
Denoised using linear regression and logistic regression for continuous and categorical variables, respectively. One-hot encoding used for categorical variables with more than 2 levels.
Additionally, models with forced confounders were run to check for any biase in the denoised datasets.
Four models run for each outcome (lung/bladder cancer):
Lung: Calibration of lambda and pi (Top = base model; Bottom = model adjusted for smoking)
Bladder: Calibration of lambda and pi (Top = base model; Bottom = model adjusted for smoking)
Lung: Mean Odds Ratio
Bladder: Mean Odds Ratio
Dashed red line: \(threshold = max(\hat{\pi}_{base}, \hat{\pi}_{adjusted})\)
Lung
Lung: Selection Proportion
Bladder
Bladder: Selection Proportion
Checking consistency in sign of the beta coefficients for the variables with high selprop
LungLung: Base model (left) and Adjusted model (right) AUC
Bladder: Base model (left) and Adjusted model (right) AUC
Lung: Base model (left) and Adjusted model (right) AUC
Bladder: Base model (left) and Adjusted model (right) AUC
Lung: Calibration of lambda and pi (Top = base model; Bottom = model adjusted for smoking)
Bladder: Calibration of lambda and pi (Top = base model; Bottom = model adjusted for smoking)
Lung: Mean Odds Ratio
Bladder: Mean Odds Ratio
Lung: Selection Proportion
Bladder: Selection Proportion
Lung: Base model (left) and Adjusted model (right) AUC
Bladder: Base model (left) and Adjusted model (right) AUC
Lung: Base model (left) and Adjusted model (right) AUC
Bladder: Base model (left) and Adjusted model (right) AUC
Stability analyses for sPLS on lung adjusted for age, sex and BMI
Stability analyses for sPLS on lung adjusted for age, sex, BMI and smoking
Stability selection for sPLS on lung adjusted for age, sex, and BMI
Selection proportion for sPLS on lung adjusted for age, sex, and BMI
Use results from stability selection for sPLS, lambda = 36
Loading coefficients from sPLS on lung adjusted for age, sex, and BMI
Stability selection for sPLS on lung adjusted for age, sex, BMI and smoking
Selection proportion for sPLS on lung adjusted for age, sex, BMI and smoking
Use results from stability selection for sPLS, lambda = 38
Loading coefficients from sPLS on lung adjusted for age, sex, BMI and smoking
Stability analyses for sPLS on bladder adjusted for age, sex and BMI
Stability analysis for sPLS on bladder adjusted for age, sex, BMI and smoking
Stability selection for sPLS on bladder adjusted for age, sex, and BMI
Selection proportion for sPLS on bladder adjusted for age, sex, and BMI
Use results from stability selection for sPLS, lambda = 22
Loading coefficients from sPLS on bladder adjusted for age, sex, and BMI
Stability selection for sPLS on bladder adjusted for age, sex, BMI and smoking
Selection proportion for sPLS on bladder adjusted for age, sex, BMI and smoking
Use results from stability selection for sPLS, lambda = 26
Loading coefficients from sPLS on bladder adjusted for age, sex, BMI and smoking
Lung cancer:
Bladder cancer:
Key points:
Questions
Lung cancer:
Bladder cancer:
Question
Report (Results section) outline:
Next steps are in bold